Classification is the process which finds a model to categorize data into unique groups. Regression models the data into a continuous function between two or more variables. The main difference between classification and regression is how the data are treated. In classification, the data are treated as categorical. In regression, the data are treated as continuous. Both methods use the data to create a model, but the models are different in nature.
Load the data.
adultdata <- read.csv("~/Desktop/adultdata.txt")
Data descriptions (as reported for the 1990 census):
age is the age of an individual; integer.workclass is the work class of an individual; string (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.)fnlwgt is unknwon as there is not enough information to describe the variable; integer.education is the completed education of an individual; string (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)education.num is the cooresponding number to the completed education of an individual; integer.marital.status is the marital status of an individual; string (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.)occupation is the occupation of an individual; string (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.)relationship is the relationship of the individual to a specific household; string (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.)race is the race of an individual; string (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.)sex is the sex of an individual; string (Female, Male.)capital.gain is the capital gain of an individual; integer.capital.loss is the capital loss of an individual; integer.hours.per.week is the number of hours worked per week by an individual; integer.native.country is the native country of an individual; string (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.)income is the income class of an individual; string (<=50K, >50K.)adultdata[adultdata == " ?"] <- NA
apply(is.na(adultdata),2,sum)/nrow(adultdata)*100
## age workclass fnlwgt education education.num
## 0.000000 5.638647 0.000000 0.000000 0.000000
## marital.status occupation relationship race sex
## 0.000000 5.660146 0.000000 0.000000 0.000000
## capital.gain capital.loss hours.per.week native.country income
## 0.000000 0.000000 0.000000 1.790486 0.000000
From the code above…
workclass has 5.638647% missing valuesoccupation has 5.660146% missing values.native.country has 1.790486% missing values.The histogram below shows the distribution of missing values for each row in the data set.
hist(apply(is.na(adultdata),1,sum),
xlab = "Number of Missing Values Per Row",
ylab = "Frequency",
main = "Number of Missing Values",
breaks = 3,
col = "blue")
The following variables are numeric:
agefnlwgteducation.numcaptial.gaincapital.losshours.per.weekThe following variables are categorical:
workclasseducationmarital.statusoccupationrelationshipracesexnative.countryincomeSplit these variables into partitioned sets for future use.
adultnumeric <- adultdata[,c(1,3,5,11,12,13)]
adultcat <- adultdata[,-c(1,3,5,11,12,13)]
Create a histogram for each numeric variable with length(x) bins if length(x) < 100, else with 100 bins.
apply(adultnumeric, 2, function(x) ifelse(length(unique(x)) < 100,
hist(x,
breaks = length(unique(x)),
main = "",
xlab = ""),
hist(x,
breaks = 100,
main = "",
xlab = "")))
## $age
## $age[[1]]
## [1] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## [24] 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
## [47] 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [70] 86 87 88 89 90
##
##
## $fnlwgt
## $fnlwgt[[1]]
## [1] 0 20000 40000 60000 80000 100000 120000 140000
## [9] 160000 180000 200000 220000 240000 260000 280000 300000
## [17] 320000 340000 360000 380000 400000 420000 440000 460000
## [25] 480000 500000 520000 540000 560000 580000 600000 620000
## [33] 640000 660000 680000 700000 720000 740000 760000 780000
## [41] 800000 820000 840000 860000 880000 900000 920000 940000
## [49] 960000 980000 1000000 1020000 1040000 1060000 1080000 1100000
## [57] 1120000 1140000 1160000 1180000 1200000 1220000 1240000 1260000
## [65] 1280000 1300000 1320000 1340000 1360000 1380000 1400000 1420000
## [73] 1440000 1460000 1480000 1500000
##
##
## $education.num
## $education.num[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
##
##
## $capital.gain
## $capital.gain[[1]]
## [1] 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
## [11] 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000
## [21] 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000
## [31] 30000 31000 32000 33000 34000 35000 36000 37000 38000 39000
## [41] 40000 41000 42000 43000 44000 45000 46000 47000 48000 49000
## [51] 50000 51000 52000 53000 54000 55000 56000 57000 58000 59000
## [61] 60000 61000 62000 63000 64000 65000 66000 67000 68000 69000
## [71] 70000 71000 72000 73000 74000 75000 76000 77000 78000 79000
## [81] 80000 81000 82000 83000 84000 85000 86000 87000 88000 89000
## [91] 90000 91000 92000 93000 94000 95000 96000 97000 98000 99000
## [101] 100000
##
##
## $capital.loss
## $capital.loss[[1]]
## [1] 0 50 100 150 200 250 300 350 400 450 500 550 600 650
## [15] 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350
## [29] 1400 1450 1500 1550 1600 1650 1700 1750 1800 1850 1900 1950 2000 2050
## [43] 2100 2150 2200 2250 2300 2350 2400 2450 2500 2550 2600 2650 2700 2750
## [57] 2800 2850 2900 2950 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450
## [71] 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 4050 4100 4150
## [85] 4200 4250 4300 4350 4400
##
##
## $hours.per.week
## $hours.per.week[[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## [70] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [93] 93 94 95 96 97 98 99
Create two side-by-side histograms for each numeric variables. One historgram (blue) is the oberservations where income = "<=50K". The other (red) is the observations where income = ">50K".
lesser <- adultdata[which(adultdata$income == " <=50K"), ]
greater <- adultdata[which(adultdata$income == " >50K"), ]
for(i in c(1,3,5,11,12,13)){
par(mfrow=c(1,2))
hist(lesser[,i],
xlab = names(lesser)[i],
ylab = "Frequency",
main = "",
col = "blue")
hist(greater[,i],
xlab = names(greater)[i],
ylab = "Frequency",
main = "",
col = "red")
}
Create two side-by-side boxplots. The color coding is the same.
for(i in c(1,3,5,11,12,13)){
par(mfrow=c(1,2))
boxplot(lesser[,i],
xlab = names(lesser)[i],
ylab = "Frequency",
main = "",
col = "blue")
boxplot(greater[,i],
xlab = names(greater)[i],
ylab = "Frequency",
main = "",
col = "red")
}
par(mfrow = c(1,1))
Based on the histograms and box plots, the data are similarly distributed regardless of income class. There appears to be a greater number of higher income individuals with higher education. Higher income individuals also have a higher chance of working more hours per week.
Create a bar plot for each variable representing the frequency of each unique value.
for(i in c(2,4,6,7,8,9,10,14,15)){
barplot(table(adultdata[,i]),
las = 2,
cex.names = 0.65,
main = "",
xlab = names(adultcat)[i],
col = "blue")
}
Create two side-by-side bar charts for each categorical variables. One bar chart (blue) is the oberservations where income = "<=50K". The other (red) is the observations where income = ">50K".
for(i in c(2,4,6,7,8,9,10,14,15)){
par(mfrow=c(1,2))
barplot(table(lesser[,i]),
las = 2,
cex.names = 0.65,
main = "",
xlab = names(lesser)[i],
col = "blue")
barplot(table(greater[,i]),
las = 2,
cex.names = 0.65,
main = "",
xlab = names(greater)[i],
col = "red")
}
par(mfrow = c(1,1))
One thing that the data shows is that more higher income individuals have a masters degree than lower income invididuals. Less higher income individuals are divorsed than lower income individuals.
plot(y = adultdata$education.num, x = adultdata$age,
xlab = "Age",
ylab = "Education",
main = "Education vs. Age")
cor(adultdata$education.num, adultdata$age)
## [1] 0.03652719
There is almost no relationship between education and age. The correlation is near zero, implying the lack of relationship. The plot shows no intuitive pattern(s) either.
plot(x = adultnumeric$capital.gain, y = adultnumeric$hours.per.week,
xlab = "Capital Gain",
ylab = "Hours Per Week",
main = "Capital Gain vs. Hours Per Week")
cor(adultnumeric$capital.gain, adultnumeric$hours.per.week)
## [1] 0.07840862
There is no relationship between capital gain and hours worked per week. The correlation value between these two variables is almost zero, implying no relationship between the two variables.
Load the data.
nfl.passing.2014 <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/nfl-passing-2014.csv")
summary(nfl.passing.2014$TD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 2.00 8.17 12.75 39.00
names(sort(-table(nfl.passing.2014$TD)))[1] # mode
## [1] "0"
The mean of TD is 8.17, the median is 2.00, and the mode is 0.
summary(nfl.passing.2014$Int)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 2.0 4.5 9.0 18.0
names(sort(-table(nfl.passing.2014$Int)))[1] # mode
## [1] "0"
The mean of Int is 4.5, the median is 2.00, and the mode is 0.
quantile(nfl.passing.2014$Yds)
## 0% 25% 50% 75% 100%
## 0.00 19.25 274.50 2653.25 4671.00
The lower and upper quantiles for YDS are 19.25 and 2653.25, respectively.
quantile(nfl.passing.2014$Rate)
## 0% 25% 50% 75% 100%
## 27.90 55.45 84.15 96.40 158.30
The lower and upper quantiles for RATE are 55.45 and 96.40, respectively.
fivenum(nfl.passing.2014$Cmp.)
## [1] 0.0 50.0 60.4 67.0 100.0
The five number summary of CMP. is above.
fivenum(nfl.passing.2014$Yds)
## [1] 0.0 19.0 274.5 2710.0 4671.0
The five number summary of YDS is above.
Create a histogram of YARDS with 4, 8, and 12 bins respectively.
hist(nfl.passing.2014$Yds,
xlab = "Passing Yards",
main = "Histogram of Passing Yards",
col = "blue",
breaks = 4)
hist(nfl.passing.2014$Yds,
xlab = "Passing Yards",
main = "Histogram of Passing Yards",
col = "blue",
breaks = 8)
hist(nfl.passing.2014$Yds,
xlab = "Passing Yards",
main = "Histogram of Passing Yards",
col = "blue",
breaks = 12)
Something strange is happening with these plots. Despite setting each break value to 4, 8, and 12, respectively, they do not appear to be binning correctly.
Compare the distribution of the of the number of YARDS of each quarter back based on whether they threw less than 12 interceptions.
lessthan12 <- nfl.passing.2014[which(nfl.passing.2014$Int < 12), ]
hist(lessthan12$Yds,
xlab = "Passing Yards",
main = "Distribution of Passing Yards")
Draw a scatter plot of TD vs. INT.
plot(x = nfl.passing.2014$TD,
y = nfl.passing.2014$Int,
xlab = "Passing Touchdowns",
ylab = "Passing Interceptions",
main = "Passing Touchdowns vs. Interceptions")
Load the data.
tennis <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/2014w.csv")
List top 5 players who have played a minimum of 20 matches with the highest average number of games won in a match.
twenty <- tennis[tennis$Winner %in% names(subset(table(tennis$Winner), table(tennis$Winner) >= 20)), ]
twenty <- twenty[, c(10, 16,18,20)]
twenty$avg <- rowMeans(twenty[,c(2,3,4)], na.rm = TRUE)
head(twenty[order(-twenty$avg),])
## Winner W1 W2 W3 avg
## 1421 Pliskova Ka. 6 6 10 7.333333
## 12 Ivanovic A. 7 7 NA 7.000000
## 129 Kerber A. 7 7 NA 7.000000
## 168 Keys M. 6 6 9 7.000000
## 338 Peng S. 7 7 NA 7.000000
## 645 Garcia C. 7 7 NA 7.000000
From the code above, the top 5 players are:
Load the data.
wc.data.2014 <- read.csv("~/Google Drive/Academic/Spring 2016/CS 4821/wc-data-2014.csv")
groupStage <- wc.data.2014[1:48,]
groupStage$home_dif <- groupStage$home_score - groupStage$away_score
groupStage$away_dif <- groupStage$away_score - groupStage$home_score
sort(tapply(groupStage$home_dif, groupStage$home, FUN=sum) +
tapply(groupStage$away_dif, groupStage$away, FUN=sum),
decreasing = TRUE)
## Columbia Netherlands France Brazil Germany Argentina
## 7 7 6 5 5 3
## Belgium Costa Rica Mexico Chile Algeria Switzerland
## 3 3 3 2 1 1
## Bosnia Croatia Ecuador Nigeria Uruguay USA
## 0 0 0 0 0 0
## Italy Ivory Coast Russia England Ghana Greece
## -1 -1 -1 -2 -2 -2
## Iran Korea Portugal Spain Japan Australia
## -3 -3 -3 -3 -4 -6
## Honduras Cameroon
## -7 -8
From the code above, Columbia and the Netherlands have the best differential.
sort(tapply(groupStage$home_dif, groupStage$home_continent, FUN=sum) +
tapply(groupStage$away_dif, groupStage$away_continent, FUN=sum),
decreasing = TRUE)
## South America Europe CONCACAF Africa Asia
## 17 10 -1 -10 -16
From the code above, South America has the best differential.